Self-Monitoring Assessments for Educational Accountability Systems
نویسندگان
چکیده
Test-based accountability is now the cornerstone of U.S. education policy, and it is becoming more important in many other nations as well. Educators sometimes respond to test-based accountability in ways that produce score inflation. In the past, score inflation has usually been evaluated by comparing trends in scores on a high-stakes test to trends on a lower-stakes audit test. However, separate audit tests are often unavailable, and their use has several important drawbacks, such as potential bias from motivational differences. As an alternative, we propose self-monitoring assessments (SMAs) that incorporate audit components into operational high-stakes assessments. This paper provides a framework for designing SMAs. It describes five specific SMA designs that could be incorporated into the non-equivalent groups anchor test linking approaches used by most large-scale assessments and discusses analytical issues that would arise in their use. Review draft: 26 June 2010 In recent decades, test-based accountability (TBA) has become the cornerstone of U.S. education policy. Pressure on educators to raise scores has increased from one wave of reforms to the next. TBA, well-established for some time in the U.S. and England, is now appearing in many other nations as well. The net effects of these TBA policies, and particularly, variations in the net effects across types of schools, students, tests, and accountability systems, remain uncertain. However, research has made clear that in their attempts to raise scores, educators often resort to undesirable strategies. These include focusing too narrowly on tested content and providing test preparation that capitalizes on substantively unimportant aspects of the test, such as format or unimportant features of scoring rubrics (e.g., Stecher 2002, Stecher & Mitchell, 1995). These responses can undermine the test’s representativeness of the domain and thereby produce score inflation, i.e., increases in scores that are larger than improvements in mastery of the domain would warrant. Research has shown that this inflation can be very large (Klein, Hamilton, McCaffrey, & Stecher, 2000; Koretz & Barron, 1998; Koretz, Linn, Dunbar, and Shepard, 1991). These distortions should not be surprising. They are a manifestation of Campbell’s Law: The more any quantitative social indicator is used for social decision making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor (Campbell, 1979, p. 87). The distortions described by Campbell have been documented in a wide variety of fields other than education (e.g., Rothstein, 2008). Self-monitoring assessments 1 Review draft: 26 June 2010 As Feuer (this volume) and others have argued, Campbell’s Law is not in itself reason to avoid performance-based accountability systems. Even with distortions, the net effect of such a system can be strongly positive. However, the presence of severe distortions, such as have been found in research on test-based accountability programs, does warrant response. Feuer (this volume) notes two responses: better evaluation of the effects of the program, and efforts to design the programs to minimize unwanted effects and maximize positive effects. There are two distinct but overlapping strategies for better design. The first is to design testing programs to make them less vulnerable to undesirable behavioral responses and score inflation. The second is to design the accountability programs into which tests are embedded to lessen such responses, for example, by giving substantial weight to factors other than increases in standardized test scores e.g. judgments from an inspectorate or non test based indicators, such as percentage drop-outs. In this paper, we address only the former. We suggest an approach to test design that should both decrease the incentives to prepare students inappropriately and facilitate better evaluation of the effects of accountability, both positive and negative. Evaluating Score Inflation In most of the research to date, score inflation has been evaluated by comparing trends on a high-stakes test to trends on an audit test—a lowor lower-stakes test intended to measure a reasonably similar domain of achievement. In the U.S., the National Assessment of Educational Progress (NAEP) has been used most often as an audit because it is a broad measure, reflects a degree of consensus about the goals of education, and is rarely the focus of explicit test preparation. However, some districts and Self-monitoring assessments 2 Review draft: 26 June 2010 states administer a second, lower-stakes test, and these have been used as audit tests in a few studies (Jacob, 2005; Schemo & Fessenden, 2003). This approach has several important limitations. Suitable audit tests are often unavailable or are infrequent. For example, NAEP is administered in only three grades, and not every year. When data from a second test are available, its suitability as an audit may be limited. For example, the substantive appropriateness of the second test may be arguable. Even when their substantive appropriateness is clear, audit tests are generally not on the same scale as the high-stakes tests to which they are compared. In addition, if students know that the audit test has no consequences, the comparison may be confounded with motivational effects. While differences in motivation are less problematic for comparisons of trends than for cross-sectional comparisons, they are nonetheless potentially a problem for the former as well. Student-level exclusions may be different for the high-stakes and audit tests—and may change differentially over time— biasing comparisons between them. When audit tests are administered only to samples of schools, there is a risk of accidental or intentional differences in samples over time. Creating and administering a new audit test rather than relying on extant measures would address only some of these limitations. We propose an alternative to separate audit tests: self-monitoring assessments (SMAs). SMAs would incorporate one or more audit components into the operational forms of high-stakes tests. In some but not all cases, this approach would allow the audit and high-stakes measures to be placed on the same scale, and it would address the other limitations of separate audit tests noted above as well. Self-monitoring assessments 3 Review draft: 26 June 2010 This paper describes methods for designing and using SMAs. The following section briefly sketches the framework for evaluating validity of inferences under highstakes conditions detailed by Koretz and colleagues (Koretz, McCaffrey, and Hamilton, 2001; Koretz and Hamilton, 2006), which undergirds our design for SMAs. The paper then sketches several different designs for SMAs. A subsequent section explores some technical issues that these designs raise for analysis. A final section discusses implications and unresolved issues. A Psychometric Framework for SMAs In the United States, the evolution of test-based accountability has contributed to a number of innovations in test design. These include the development of a variety of performance assessments and other approaches for cognitively richer assessments; the development of ‘standards-based’ tests aligned with states’ content and performance standards; innovations in methods for setting performance standards; and advances in growth modeling. However valuable, none of these developments directly addresses what we consider to be the core problem underlying score inflation: predictable recurrences of substantive and nonsubstantive sampling in the design and construction of tests. To evaluate the potential for score inflation, it is necessary to specify what aspects of sampling can offer the potential for inappropriate test preparation. Koretz et al. (2001, 2006) suggest that for evaluating the validity of inferences under high-stakes conditions (VIHS), we view a test as a collection of performance elements, a deliberately vague term that denotes all aspects of performance that contribute either to performance on a test or to the inferences about achievement that are based on it. Substantive elements are those that are relevant to the inference based on scores. Non-substantive elements are not the Self-monitoring assessments 4 Review draft: 26 June 2010 focus of inference. Koretz and Hamilton (2006, p. 545) give the example of “facility with a particular format that is of no particular importance for the intended inference.” It is important to note, however that non-substantive elements include far more than item format, which is typically used to refer only to multiple-choice, short constructedresponse, and so on. For example, some problems in elementary algebra may be represented verbally, algebraically, graphically, or pictorially. This choice, which may be of no substantive importance, may be used repeatedly. A test writer may consistently present graphs of linear equations in one variable with positive slopes and positive intercepts, even if the signs of the slope and intercept are of no substantive importance for the inference. Therefore, it is more accurate to refer non-substantive elements as aspects of item style rather than format. Response demands may also be both substantive and non-substantive. Scoring rubrics often provide opportunities for coaching that Stecher and Mitchell (1995) called “teaching to the rubric”—as described by one teacher, “What’s in the rubrics gets done, and what isn’t doesn’t.” Stecher and Mitchell noted that this “May cause teachers to neglect important...skills not addressed by the rubrics and neglect tasks not well aligned to [them]” (1995, p. ix). The choice among rubrics, however, is often at least in part nonsubstantive. All of these substantive and non-substantive design decisions can affect scores. Koretz et al. (2001) define an element’s effective test weight as the sensitivity of the score to change in performance on that element. They assume no specific model for compositing performance on elements into scores. If the test scoreζ is any function of the performance elements i π , Self-monitoring assessments 5 Review draft: 26 June 2010
منابع مشابه
Accountability and Local Control: Response to Incentives with and without Authority over Resource Generation and Allocation
This article examines the interaction between school accountability and local control over revenue raising and resource allocation. In particular, it asks whether accountability policies are more or less effective at improving student outcomes in states with stronger local control. Local control is operationalized with multiple measures, including the percentage of education funding from catego...
متن کاملAssessment and Accountability Across the 50 States
In recent years, all 50 states have embarked on education initiatives related to high standards and challenging content. A central focus of these efforts has been the establishment of a common set of academic standards for all students, the assessments that measure student performance, and accountability systems that are at least partially focused on student outcomes. This CPRE Policy Brief sum...
متن کاملOnline formative assessments: exploring their educational value
Introduction: Online formative assessments (OFA’s) have beenincreasingly recognised in medical education as resources thatpromote self-directed learning. Formative assessments are usedto support the self-directed learning of students. Online formativeassessments have been identified to be less time consuming withautomated feedback. This pilot study aimed to determine whetherparticipation and pe...
متن کاملThe Self-formation of Collaborative Groups in a Problem Based Learning Environment
formation of collaborative groups in a the self of the three steps method The aim of this paper is to present problem-based learning environment. The self-formation of collaborative groups is based on sharing of accountability among students for solving instructional problems. The steps of the method are planning collaborative problem solving, self-evaluation of students, and building collabora...
متن کاملThe New Accountability
As part of standards-based reform, states and districts are designing new approaches to holding schools and districts accountable for discharging their missions. Virtually every state and thousands of districts are working on developing standards for student learning and aligning student assessments to those expectations. Most are taking the next step which is to use achievement of the standard...
متن کامل